Unsupervised Phoneme Segmentation Using Transformed Cepstrum Features
نویسندگان
چکیده
One of the basic problems in speech engineering is phoneme segmentation, that is, to divide a speech stream into a string of phonemes. Automatic Speech Recognition (ASR) models often require reliable phoneme segmentation in the initial training phase, and Text-to-Speech (TTS) systems need a large speech database with correct phoneme segmentation information for improving the performance. Human speech is a smoothly changing continuous signal. Unlike written language, speech signals don’t include explicit marks, such as space, for segmentation. Moreover, there usually does not exist abrupt changes in speech signals due to the temporal constraints of vocal tract motions. The difficulty of phoneme segmentation comes from co-articulation of speech sounds, where acoustic realization of one phoneme may blend or fuse with its adjacent sounds. This phenomenon can even exist at a distance of two or more phonemes. All these facts make automatic phoneme segmentation a challenging problem. Previous approaches to phoneme segmentation can be divided into two categories: supervised and unsupervised segmentation. In the first case, both the linguistic contents and the the acoustic models of phonemes are available. Thus the segmentation problem can be reduced to align speech signals with a string of acoustic models. Perhaps the most famous approach of this category is HMM-based force alignment [2]. The second category of method tries to perform phonetic segmentation without using any prior knowledge on linguistic contents and acoustic models. The approach of this paper belongs to the 2nd class. The unsupervised segmentation is similar to the situation that infants acquire speech [11]. Infants don’t have acoustic and linguistic models for segmentation. However, psychological facts indicate that infants become able to segment speech according to acoustic difference between speech sounds and cluster speech segments into categories [8]. It is only by this procedure that infants can gradually construct the speech model of their native languages. Most of the previous approaches to this problem focus on detecting the change points of speech
منابع مشابه
Metric learning for unsupervised phoneme segmentation
Unsupervised phoneme segmentation aims at dividing a speech stream into phonemes without using any prior knowledge of linguistic contents and acoustic models. In [1], we formulated this problem into an optimization framework, and developed an objective function, summation of squared error (SSE) based on the Euclidean distance of cepstral features. However, it is unknown whether or not Euclidean...
متن کاملUnsupervised Phoneme Segmentation Using Mahalanobis Distance
Abstract One of the fundamental problems in speech engineering is phoneme segmentation. Approaches to phoneme segmentation can be divided into two categories: supervised and unsupervised segmentation. The approach of this paper belongs to the 2nd category, which tries to perform phonetic segmentation without using any prior knowledge on linguistic contents and acoustic models. In an earlier wor...
متن کاملApplying Independent Component Analysis for Speech Feature Detection
An approach to speech feature detection is developed, which uses the technique of independent component analysis for a blind (unsupervised learning) detection of basic vectors in the Fourier space. This kind of features could replace the Mel Frequency Cepstrum Coefficient (MFCC) features, widely used today for phoneme-based speech recognition. Alternatively, the ICA components could act as basi...
متن کاملSpeech/Non-Speech Segmentation Based on Phoneme Recognition Features
This work assesses different approaches for speech and non-speech segmentation of audio data and proposes a new, high-level representation of audio signals based on phoneme recognition features suitable for speech/non-speech discrimination tasks. Unlike previous model-based approaches, where speech and non-speech classes were usually modeled by several models, we develop a representation where ...
متن کاملSparse auto-associative neural networks: theory and application to speech recognition
This paper introduces the sparse auto-associative neural network (SAANN) in which the internal hidden layer output is forced to be sparse. This is achieved by adding a sparse regularization term to the original reconstruction error cost function, and updating the parameters of the network to minimize the overall cost. We show applicability of this network to phoneme recognition by extracting sp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008